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Abstract 


Many outcomes of interest in the social and health sciences, as well as in modern 
applications in computational social science and experimentation on social media plat¬ 
forms, are ordinal and do not have a meaningful scale. Causal analyses that leverage 
this type of data, termed ordinal non-numeric, require careful treatment, as much of 
the classical potential outcomes literature is concerned with estimation and hypothesis 
testing for outcomes whose relative magnitudes are well defined. Here, we propose a 
class of finite population causal estimands that depend on conditional distributions of 
the potential outcomes, and provide an interpretable summary of causal effects when 
no scale is available. We formulate a relaxation of the Fisherian sharp null hypothe¬ 
sis of constant effect that accommodates the scale-free nature of ordinal non-numeric 
data. We develop a Bayesian procedure to estimate the proposed causal estimands 
that leverages the rank likelihood. We illustrate these methods with an application to 
educational outcomes in the General Social Survey. 


Keywords: Ordinal non-numeric data, potential outcomes, Fisher’s exact test, 
estimation of causal effects, rank likelihood. 
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1 Introduction 


Outcomes in the social and economic sciences are frequently ordered, however, it is not always 
the case that the scale or magnitude of the outcomes is available. That is, although outcome 
Yi might belong to category one, labeled “low”, and outcome YJ/ belongs to category two, 
labeled “high”, the only information about the relative relationship of the two outcomes is that 
Yi < Yi'. Data that exhibit such a structure are termed ordinal non-numeric^ and although 
the categories are frequently represented by integer values, there is no substantive information 
in the data about the relative magnitude of the categories. Examples of such data abound 
in education research, e.g., the level of education: “high school diploma”, “college”,“masters”, 
“PhD” (Dubow et ah, 2009), in operations research, in employment records, e.g., for job 
seniority data: “staff”, “manager”, “vice president”, “president” (Singh and Pestonjee, 1990), 
and in healthcare research, e.g., when the outcome is pain: “none”, “mild”, “severe” (Collins 
et ah, 1997). In this context, it is frequently of interest to make causal statements about 
how some treatment might affect an individual’s category, e.g., whether a drug reduces pain 
level or whether a vocational program leads to better job prospects (Mealli et ah, 2012). 

The first step when attempting any causal analysis is the choice of an estimand, or in¬ 
ferential target, which is the object of interest. When outcomes are continuous, the most 
commonly studied estimand is the average treatment effect (e.g., see Rubin, 1974; Holland, 
1986). However, this quantity is not well defined for non-numeric data as the notion of an av¬ 
erage of more than two ordinal values is not well defined. Other measures of centrality might 
be of interest; for example, the modal difference between treatment and control describes the 
most common number of categories that are changed due to treatment. However, measures 
of centrality for non-numeric data do not provide a complete picture because the relative 
magnitude between pairs of categories is not well defined. For example, a modal treatment 
effect of zero, indicating that most often there is no change due to treatment, might conceal 
information that the treatment is effective for the subsets of individuals whose potential 
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outcome equals category one, under control, but is ineffective for anyone else. These issues 
motivate our development of a class of conditional estimands in Section 2. 

The causal inference literature based on the potential outcomes framework (Rubin, 1974) 
has focused on a special case of ordinal non-numeric response: the binary outcome. This is 
a special case because although the categories are ordered and non-numeric (for example in 
medicine: 0 = dead, 1 = alive), there are no relative magnitudes to consider (Rosenbaum and 
Rubin, 1983), that is, the average of binary outcomes is simply the proportion. The presence 
of more outcome levels has frequently led to model based causal analysis (Rubin, 1978) 
without first defining estimands based on the potential outcomes (Shaikh and Vytlacil, 2005; 
Cunha et ah, 2007). A tempting advantage of model-based inference for ordinal non-numeric 
outcomes is the availability of a continuous latent variable formulation of the outcomes. 
Within the context of the potential outcomes framework, a model-based approach assumes 
the existence of a continuous potential outcome Zi{t) and a mapping g : Zi{t) —?• Yi{t) that 
discretizes Zi{t) into an ordinal non-numeric observation, 17(t). We refer to the former as 
potential outcomes on the latent scale, or as latent potential outcomes, to distinguish them 
from the actual potential outcomes, which are observed on the measurement scale. It is 
possible to define estimands for either of these types of potential outcomes, but we show 
that the latent variable formulation suffers from undesirable identifiability issues in Section 
2 . 

The remainder of this paper is structured as follows. In Section 2 we formally revisit the 
potential outcomes framework and describe various interesting estimands. In Section 3 we 
revisit the framework for the Fisher exact test for hypothesis sharp null of no causal effect. 
Unlike in the continuous case, the sharp null of constant (non zero) effect is not available 
for ordinal non-numeric data and so we develop a novel test for the related null hypothesis 
of an effect that changes at most one category. Section 4 describes estimation procedures 
for estimands on the observed scale with and without additional assumption on the joint 


5 


distribution of the potential outcomes. In Section 5 we outline two Bayesian procedures 
for estimating causal effects on the observed scale, one within the modeling framework of a 
standard ordered probit and one based on the rank likelihood. These methods are illustrated 
through a practical example based on data from the General Social Survey, in Section 6. 
Some concluding remarks follow. 


2 Potential outcomes framework and estimands 

In this section we provide a general introduction to the potential outcomes framework for 
causal inference, frequently referred to as the Rubin Causal Model (Rubin, 1974, 1978). A 
detailed history and exposition is available in Rubin (2005). 

The formal potential outcomes framework provides a clear separation between the science, 
i.e., the object of inference, and the process by which inference about the science is made. 
Brief dehnitions of specihc concepts are as follows. The unit of inference is a physical object 
at a particular point in time, e.g., an individual. For a binary treatment T G {0,1} and N 
units, a table that codihes the science under the Stable-Unit-Treatment-Value Assumption 
(SUTVA, Rubin (1980)) is an V x 2 table where each row equals (17(0), 17(1)), where 17(0) 
is the potential outcome for unit i under control and 17 ( 1 ) is the potential outcome for unit 
i under treatment. A unit level causal effect is a comparison of the potential outcomes for a 
given unit. However, we can only observe one of the potential outcomes for each unit—that 
is, the treatment status of unit i is either H7 = 0 or H7 = 1. As such, unit specihc causal 
effects cannot be observed and must be inferred. This is facilitated by different assumptions 
that can be imposed on the marginal and joint distributions of the treatment assignments 
Wi and the science table. 

One of the most commonly studied causal estimands, both in terms of theory and appli¬ 
cations (Rubin, 1974; Lin et ah, 2013) is the average causal effect E[y(l) — V(0)]. Under 
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relatively mild conditions (e.g., see Imbens and Rubin, 2015) this quantity can be estimated 
in both randomized and observational studies. A fundamental property of this estimand is 
that its dependence on unit level causal effects is separable into a dependence on only the 
potential outcomes under treatment and only the potential outcomes under control due to 
the linearity of the expectation. As we will see below, many estiniands that are of interest 
when dealing with ordinal non-numeric outcomes do not possess this property. We argue 
that this particular type of average causal effect is not meaningful for ordinal non-numeric 
outcomes. 

For ordinal non-numeric data, one can consider estimands on the observed scale or, if 
assuming the existence of a latent variable representation of the science table, on a latent 
scale. Recall that on the latent scale we make the assumption that there are latent potential 
outcomes (Zj(0), Zj(l)) that are related to the observed scale via a deterministic function 
g : Zi{t) ^ Yi{t), f = 0,1. The mapping is dependent on the ordering on the observed scale 
in the following sense; if there are categories that admit different orderings based on different 
criteria then any inference must be conditional on the choice of a specihc criterion (e.g. it 
takes longer to become an MD than it does to become a lawyer but the starting salary of 
lawyers is higher than that of doctors). This suggests that the function g must be conditioned 
on the ordering criterion, on the observed scale, to maintain the appropriate ordering of 
potential outcomes on the latent scale. An additional complication for causal inference 
using the latent potential outcomes is that the function g is rarely known explicitly, making 
many estimands on the latent scale difficult to interpret. In what follows we hrst discuss 
estimands of interest on the observed scale, arguing for the use of conditional estimands to 
best capture the effect of treatment in ordinal non-numeric data. We then describe estimands 
on the latent scale and note that due to the complications associated with the latent scale, 
the appropriate estimands for most applications are on the observed scale. 

Throughout, the treatment is binary, there are k outcome levels on the observed scale. 
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and the latent scale is continnons and nnbonnded on the real line. 


2.1 Estimands on the observed scale 

As with the general potential ontconies framework, the complete information abont any 
cansal effect is fonnd in the joint distribntion of the potential ontcomes; here we ignore 
the possible presence of covariates. For the observed scale this is snmmarized by a fc x 
k matrix P where pij = Pr(y(0) = i,Y{l) = j). All estimands are fnnctions of this 
matrix P. For example, the pair of marginal distribntions, {Pq, Pi) = (Pr(y'(0), Pr(y'(1))) = 
(FI, 1*F), is a 2k dimensional snmmary of the joint matrix F. If the potential ontcomes 
are independent, then these marginals encode all the information in the joint distribntion. 
The high dimensional natnre of these estimands rednces their simplicity and so we strive for 
lower dimensional snmmaries that are more easily commnnicated. 

One dimensional estimands. The average treatment effect (ATE) mentioned above, a 
popnlar scalar estimand, is not meaningfnl in the setting of ordinal non-nnmeric ontcomes 
becanse the expectation is not well defined. Other measnres of centrality snffer from a 
similar degeneracy as they effectively assign a scale to the differences. For example, for a 
pair of nnits i and j it might happen that Fi(l) — yj(0) = 4 — 3 = Yj{l) — Yj{0) = 2 — 1. 
Thns both of these differences wonld contribnte the same information to any fnnction / that 
only depends on the difference. Similarly the difference in the medians nnder treatment 
and nnder control, median[y(l)] — median[y(0)] obscnres the scaling issne. As snch, the 
only one dimensional snmmary that is meaningfnl when the scale of the observations is not 
identified, is one that snmmarizes the difference between the marginal distribntions of the 
potential ontcomes. Formally, let d{-, ■) be a metric on the space of probabilities and define 
the estimand d{Po,Pi). This estimand is especially important when considering the sharp 
nnll hypothesis of no effect, as we show in Section 3. 


Multidimensional estimands. As one dimensional summaries are often insufficient for 
providing adequate information about any causal effects between treatment and control 
groups, multidimensional estimands are required. A two dimensional estimand {fo[Y (0)], fi[Y (1)]), 
for instance, where the functions /o, fi are independent of the scale of the potential outcomes, 
provides information that might have been concealed by estimands that considered the dif¬ 
ference /i(-) — /o(-)- As argued above, while differences provide overarching information 
about outcomes under control and treatment, they do not provide any information on the 
amount of category change due to treatment. Another two-dimensional estimand that pro¬ 
vides a compact summary of this information is the most likely pair of potential outcomes, 
ml = argmaxjjpij . In general, this is a function of P, but in cases of independence between 
the potential outcomes it becomes a low dimensional summary of Pq and Pi. 

In practice, we are interested in the effects of mechanisms by which treatment changes the 
potential outcome under control. As such, it is natural to consider estimands that condition 
on the level of the potential outcome under control. Here, we advocate for the use of a class of 
causal estimands that involves the conditional probabilities of the potential outcomes. That 
is, we might be interested in the /c-dimensional summaries Mu = median[y(l)|y(0) = i] 
or M 2 i = mode[y(l)|y(0) = i]. These are more detailed versions of the two dimensional 
estimands described above. It is important to note that any function of the conditional dis¬ 
tribution of the potential outcomes can be used as a multidimensional estimand. Conditional 
estimands provide a way of measuring the magnitude of the effect relative to treatment when 
no numeric scale is available. Since these estimands depend on the joint distribution of the 
potential outcomes, estimation requires modeling assumptions about the potential outcomes. 

The example in Section 6 illustrates how conditional estimands provide a meaningful sum¬ 
mary for the effect of parental education on a child’s education achieves, which cannot be 
quantihed by considering unconditional estimands. 

Several other multidimensional estimands have been proposed in the econometrics litera- 
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ture. Boes (2013) proposes multidimensional estimands that describe differences in the dis¬ 
tributions under the control and treatment (rather than differences in the potential outcomes 
themselves). For example, Boes defines Af® = Pi{i) — Po{i) and Ap = '^j<i{Pi{j) — Po{j))■ 
Similar effects are discussed in Li and Tobias (2008) in the context of a specific Bayesian 
model for ordered data. If one observes Af*® > 0 for all i or A®° > 0 for all i, it suggests that 
the potential outcome under treatment is stochastically greater than the potential outcome 
under control. However, if the Af*® are sometimes greater than 0 and sometimes less than 0, 
this estimand appears to carry little information unless particular meaning is available for a 
single level of A^*® or A^*^. If the goal of the estimand is to capture the difference between the 
distributions under control and treatment, the one-dimensional distance estimand proposed 
above can do that with a scalar summary. These estimands differ significantly from the 
estimands we proposed because we are interested in conditional statements that are mean¬ 
ingful for each level i individually, whereas the unconditional statements of Boes (2013) most 
often must be presented for all levels i simultaneously. Also note that interpretability of the 
latter requires the signs of all the related effects to match, an empirical result that cannot 
be guaranteed for any specific data set. 

2.2 Estimands on the latent scale 

When a latent variable formulation of ordinal non-numeric potential outcomes is used, causal 
estimands can be defined on the latent scale. Recall that we refer to the pair (Zj(0), Zi{l)) 
as to the latent potential outcomes for unit i whenever there is a function g{-) that maps 
1^(1) = g{Zi{l)) and F)(0) = g{Zi{0)). If the function g is fully identified then the contin¬ 
uous latent potential outcomes can be used as de-facto outcomes, and causal analyses can 
leverage classical results from the literature about continuous potential outcomes (Rubin, 
2005). In particular, in this case, the average treatment effect on latent potential outcomes, 
E[Z(1) — ^(0)], becomes a meaningful causal effect. However, the identifiability of g is 
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key to estimands defined on the latent scale being meaningful. Most often we attempt to 
infer g from the data, which requires dehning an explicit dependence of the map on the 
two treatments yielding functions gi and go for the treated and control potential outcome 
maps. These functions likely have both a location and a scale lack of identihability, in the 
sense that 3ht such that ht{cZi{t) + b) = gt{Zi{t)). This non identihability is critical for the 
interpretability of causal effects on the latent scale. In particular, non identihability of the 
scale leads to the following two latent ATEs, under two diherent scale assumptions, having 
the same interpretation on the observed scale: 

E[Z{1) - Z{0)] and E[5Z(1) - 5Z(0)]. 

The latter is hve times the size of the former on the latent scale, which is undesirable. 

The good news is that when this lack of identihability persists (and it is unlikely that 
there is a situation where non identihability is eliminated in a non-artihcial way) we can 
still use the latent scale in order to dehne causal ehects on the observed scale. For instance, 
we can consider the estimand medmn[gi{Z{l))\go{Z(0)) = j] as estimand for the conditional 
median of the potential outcomes on the observed scale, under treatment, given a particular 
level under control. While this might appear tautological, the explicit dependence on the 
latent scale and the map gt is important as it is a statement about the science. 


3 Hypothesis testing for ordinal causal effects 

In causal analyses a common goal is to conduct a Fisher exact test for a sharp null hypothesis. 
In the classical setting, the sharp null hypothesis of constant treatment ehect can be studied 
in the same way as a null of no ehect at all. When dealing with ordinal non-numeric data, 
however, these two cases need to be analyzed separately. For the sharp null of no ehect, we 
construct a Fisher exact test as in the classical literature, while the testing for constant non 
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zero effects requires a more precise definition of the hypothesis and leads to a permutation 
test that is not exact as the null is composite. 

3.1 Testing for no effect 

The null hypothesis of no individual level effect is the same for ordinal non-numeric data as 
it is for numeric data. We define the sharp null of no effect as Hq° : l^j(O) = 1^(1). That 
is, under this null, an individual’s potential outcomes are equal and so by observing one of 
the potential outcomes, we observe the complete science table. This leads to the following 
construction of a randomization distribution for the observed data {{Yf°^,Wi) : i < m} 
where Wi is the observed treatment status and is the observed outcome: 

1. Let n be the collection of all permutations of the integers 1 to m. 

2. For TT G n compute a test statistic Tt, based on {(1^°’’*^, W^i) '■ i < m}. 

3. Calculate Pr(T > Tident) where T is distributed according to the empirical distributions 
ofT^. 

In classical settings, the test statistic T is frequently chosen to be the average treatment 
effect. Since this is not meaningful for ordinal non-numeric outcomes, a different one dimen¬ 
sional statistic must be chosen. In particular, it is reasonable to consider a test statistic such 
as T = d{Pi, Pq) for d{-, •) some measure of distance such as total variation. 

Example: In Section 6 we analyze in detail data from the 1994 General Social Survey 
on educational outcomes of children whose parental education was either below college or 
above college (Smith et ah, 2013). The distributions of the potential outcomes under control 
and under treatment are presented in Figure 1. A visual inspection of the two distribution 
suggests that they are different. We make this concrete by performing the test for the sharp 
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null of no effect. The test statistic we use is the total variation distance between the marginal 
distributions of the potential outcomes. The range of the null distribution based on 10000 
permutations of the treatment assignment is (0.174,0.413) while the observed value of the 
test statistic is 0.804, giving a p-value of 0. 



<HS HS AS BA GRAD <HS HS AS BA GRAD 

education level education level 


Figure 1: Marginal distributions of outcomes under control and under treatment for 
educational outcome data from the 1994 General Social Survey. “<HS”=“less than 
high school”, “HS”=“high school diploma”, “AS”=“associates”, “BA”=“bachelor degree”, 
“GRAD”=“graduate degree”. 

3.2 Testing for non zero effects 

As previously discussed, since the responses we consider are ordinal non-numeric, the distance 
between consecutive categories cannot be assumed equal. As such, the sharp null of “constant 
effect” that’s normally denoted by 77““®* : 17 ( 1 ) — 17 ( 0 ) = c > 0 is no longer meaningful as 
it would immediately assign a numeric scale to the data. Because of the lack of scale, it is 
tempting to phrase a constant effect null hypothesis conditionally as 77““®* : 17 ( 1 ) — 17 ( 0 ) = c 
such that c > 0 only if 17 ( 0 ) = j meaning that the effect of treatment is a constant c categories 
for individuals with control potential outcome equal to j and zero otherwise. But such a 
conditional statement does not fully specify the null space. Without loss of generality, let 
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c = 1 and j = 1 (the lowest category); that is, under the null, the effect of treatment on 
individuals in the lowest category, under control, is one category exactly. Unfortunately, if 
we observe = 1 for hU = 1, it is not possible to construct the full science table since 
it is unclear what to set Y)™®. This leads to the following definition of what is arguably the 
simplest null hypothesis of non zero effect. 

Definition 3.1. For a k level ordinal outcome, fixing j < k, the simplest non zero effect 
null hypothesis is given by 

• For Yi(0) = j, Yi{l) — Yi(0) = 0 with probability 1 — p. 

• For Yi{0) = j, Yi{l) — U(0) = 1 with probability p. 

• Marginal probabilities: Pr(y)(0) = j) = Ij and Pr(y)(0) = j' + 1) = Ij+i- 

. ForU(0)^j, U(1)-U(0) = 0. 

Note that the statement “with probability p” can refer to a superpopulation quantity, or 
simply to the finite population proportion of individuals whose science table contains the 
requisite potential outcomes. 

A scenario when such a null hypothesis is meaningful comes up in follow-up studies, 
where preliminary results suggest a nonzero effect for certain groups, but do not have any 
information about other groups. This is the simplest null in the sense that one cannot 
formulate a nonzero null for ordinal non-numeric data without specifying at least this many 
conditionals and marginals or imposing a numeric scale. The above null can also be stated as 
a condition on the potential outcome Yi(l) under treatment by replacing “For U(0) = j .. 
with “For Yi(l) = j -|- 1...” and replacing p with 1 — p. The statement of the null in 
Definition 3.1 also dictates the procedure by which one can test the null. Fixing j,p and 
Ij, Ij+i we can complete the science table in the following way: 
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• If PVj = 0 and = j then It™'® = j + 1 with probability p or F^™'® = j with 
probability 1 — p. 

• If Phi = 1 and Fj°'^® = j + 1 then set Fj™‘® = j with probability q = plj/{plj + Ij+i) or 
F^mis _ j with probability 1 — q. 

• Else, let Fj™'® = Fj°’’®. 

A permntation test can then be performed nsing the complete science table. Becanse the 
nnll is composite, mnltiple science tables mnst be compnted and the distribntion of the test 
statistic constrncted over all of them. It is important to note that the qnantity of interest 
in this setting is the conditional probability p, snggesting that dnring the testing procednre 
the marginals are nnisance parameters that can either be integrated over, or potentially set 
to the marginals of the observed data. 

A more general nnll hypothesis that has an interpretation as a constant non zero effect can 
be motivated by a latent variable formnlation of ordinal data. Recall that, if the potential 
ontcomes have a latent representation, then for the continnons latent potential ontcomes 
(Zj(0), Zi{l)) one conld test for a constant treatment effect of ~ Zi{0) = c. 

Under certain conditions, an interpretation of this nnll is available in terms of the nnll 
specihed by Dehnition 3.1. The latent potential ontcomes are mapped back to the observed 
scale by the fnnction g. Here the explicit dependence of g on the treatment is snppressed since 
we assnme an additive treatment effect, that is gi{-) = po('~c). Let c > 0 be snch that for all 
levels j and for all latent valnes z for which g{z) = j we have g{z + c) < j + 1, with eqnality 
for some 2 ;. That is, for c small enongh, a constant latent effect is interpretable as at most 
improvement by one category on the observed scale for all individnals bnt with different 
probabilities. That is, we wonld have a combination of the simple nnlls of Definition 3.1 
where for each level j, the conditional probability that Fj(l) — Fj(0) = 0 for U(0) = j is 
given by p^^^ = where Zj = {z : g{z) = j}. 
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Table 1: Joint distribution of the potential outcomes. 
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For a general c, the interpretation of the test on the observed scale becomes more com¬ 
plicated but involves a similar structure to the one in Definition 3.1: WLOG let c > 0 be 
as before, but for a single category j, for all ^ with g{z) = j we have g{z + c) < j + 2 with 
equality for some z. As such, the improvement is at most two levels for individuals for whom 
the potential outcome under control is j. As such, a null hypothesis that corresponds to 
this on the observed scale requires defining probabilities Po,Pi,P 2 for Dj(l) — Ti(0) G {0,1, 2} 
when Yi{0) = j. In particular, p^^ = Pi^ = and P 2 ^ = 1 — p^^ — Pi\ 

3.3 Example and fiducial type intervals 

In this Section, we consider a (simulated) randomized experiment with 500 units and an 
ordinal non-numeric outcome labeled {1,2,3}, where label 1 indicated control. The joint 
distribution of the potential outcomes for the simulated data is given by Table 1. This type 
of joint represents information that the treatment leads to at most a change of plus one 
category with possibly different probabilities of change conditional on the potential outcome 
under control. The estimands of interest are the conditional probabilities qi,q 2 , and it is 
easy to see that they both fit into the paradigm of Proposition 4.1 of the next section and 
so they can be estimated without specifying an additional model for the data. For example 
1 — qi = mean(y(l) = l)/mean(y(0) = 1) where each of the means is an entry in the 
empirical distribution of the marginals. The estimates are truncated at 0 and 1 to get valid 
probabilities, as it is commonly done for correcting method of moments estimators, though 
this correction is rarely necessary in the large sample setting we consider. 
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The null hypothesis described in the previous section requires specifying values (^ 71 ,^ 72 ) ^ 
(0,1)^ that correspond to the conditional probability of a positive one category change when 
the potential outcome under control is one and two, respectively, and parameters (z/i, z/ 2 , z/ 3 ) 
in the simplex that correspond to the marginals. Let p{rii) be the p-value for the conditional 
probability qi in Table 1. Performing the test for multiple values of rji and Ui recovers 
different p-values. This suggests that we can construct 100(1 — a)% hducial type intervals 
for the conditional effects g* with the following bounds: = sup!?]* : p{rii) < a/ 2 } and 

= mi{pi : p{pi) > 1 — a/2} (e.g., see Wang, 2000; Dasgupta et al., 2014). Since the 
nuisance parameters {ui, i> 2 , z/ 3 ) are not known, we must also consider them in the sequence 
of nulls. As such, by projecting the p-values down to the space of (^ 1 ,^ 2 ) the intervals 
we recover are conservative. This procedure is superior to using a plug-in estimate for the 
nuisance parameters as the intervals using that procedure can be either conservative or anti¬ 
conservative (Imbens and Rubin, 2015). 

Here we simulate N = 500 experimental units from the joint distribution in Table 1 with 
parameters ( 01 , 02 , 03 ) = (l,l,l)/3 and (^ 1 ,^ 2 ) = (7/10,2/3). From the observed outcomes, 
the estimates are qi = 0.73 and q 2 = 0.61. We perform the test for null hypothesis where 
{V 11 V 2 ) are on a uniform 30 x 30 grid in [0.1,0.999]^ and (z/i, z/ 2 , z/ 3 ) are sampled uniformly 
on the simplex, but restricted to be no bigger than 0.6 and no smaller than 0.15 each. For 
each null, we perform the test as described in Dehnition 3.1, constructing 1000 null science 
tables and getting 100 randomizations from each one. 

Figure 2 provides the null densities for a particular null for qi and q 2 - The dashed vertical 
lines are the observed values having p-values 0 and 0.62 respectively. Figure 3 provides the 
p-values for the g* as a function of the p^. The vertical lines provide the two 95% hducial 
type intervals: [0.63, 83] and [0.56, 0.83] respectively. 
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qi q2 

Figure 2; Randomization distributions for qi (left panel) and q 2 (right panel) for parameter 
values {rii,ri2) = (0.487,0.624) and (i^i, zz2, z/3) = (0.280,0.549,0.171). 
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Figure 3: p-values as functions of parameters rji and 772 with the dashed lines representing 
95% conservative fiducial intervals and the red lines representing the observed data estimates. 
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4 Inference and approaches to identification of causal ef¬ 


fects 

Estimation of causal effects has been discussed in detail for many types of data (Rubin, 1978, 
1974, 2005), but no explicit discussion is available in the statistics literature for ordinal 
non-numeric data. Throughout we will assume the standard assumptions of the Rubin 
Causal Model, such as the stable unit treatment value assumption (SUTVA), ignorability of 
the assignment mechanism that leads to the realized outcomes on the observed scale, and 
(whenever relevant) of the assignment mechanism that leads to the realized outcomes on the 
latent scale. In such situations, estimands that are meaningful in both ordinal non-numeric 
and continuous outcome settings can be estimated using existing machinery (Imbens and 
Rubin, 2015). In particular, when operating on the observed scale without assuming a model 
for the potential outcomes, we can estimate any estimand that can be written explicitly as 
a difference between a function of the distribution of the responses under treatment and the 
distribution of the responses under control. In the classical setting of continuous outcomes 
this translates to the average treatment effect being estimable since expectations are linear 
(and so E[F(1) - F(0)] = E[R(1)] - E[F(0)]). 

For ordinal non-numeric data, however, an estimand that directly compares outcomes 
under control and under treatment without including a model for the science does not exist. 

Proposition 4.1. For ordinal non-numeric data, an estimand that cannot he written as a 
combination of separated functions such as 

/(F(0),F(l)) = /i(F(l))-/o(F(0)) 

cannot he estimated without a model for the science. For any estimand that can he separated 
as above, fi is a function of the distribution of R(l) alone and fo is a function of the 
distribution ofY{0) alone. If there exist unbiased estimates of fo and fi, then there is an 
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unbiased estimate of the estimand of interest. 


The proof of Proposition 4.1 is a direct consequence of classical results about estimation 
of causal effects (Imbens and Rubin, 2015). Intuitively, if an estimand cannot be written as 
a combination of functions that only depend on the marginal distributions of the potential 
outcomes then it necessarily depends on their joint distribution. Estimation in this case 
requires a model for the science. If an estimand can be separated as such and if unbiased 
estimates exist of each part, then any combination of unbiased estimates remains unbiased. 

Proposition 4.1 unsurprisingly states that “simple” estimands can be estimated as they 
previously have. However, it quickly emerges that for ordinal non-numeric data the esti¬ 
mands of greatest interest do not satisfy the separation of Proposition 4.1. For example, 
the conditional medians median[y(l)|y(0) = i\ described in Section 2 do not satisfy the 
condition. To estimate these effects we require additional assumptions about the science. 
The hrst assumption we consider is that of monotone treatment effects which assists in the 
identifying the joint distribution of the potential outcomes. While it is appealing, this as¬ 
sumption does not fully resolve the issues with estimating the estimands of interest. As 
such, we develop estimation under the latent variable assumption—explicitly describing the 
functional relationship between the potential outcomes on the observed and latent scales. 

4.1 Monotone treatment effect 

Formally, the assumption of a monotone treatment effect states that the potential outcome 
under treatment is at least as large as that under control. That is, Pi(l) > 1^(0) Vh 

We first consider the case of a binary treatment and ordered binary outcome. If we can 
assume monotonicity of treatment then we are able to recover the full joint distribution 
from the two marginals and hence any of the estimands previously listed. Monotonicity 
of treatment gives us that Pr(y(0) = 1,F(1) = 0) = 0. This immediately implies that 
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Pr(y (0) = 0, y (1) = 0) = Pr(y (l) = O). We can just as easily recover the other elements of 
the joint distribution; 

Pr(y'(0) = 1, y(l) = 1) = Pr(y(l) = l|y(0) = l) ■ Pr(F(0) = 1) 

= Pr(y(0) = 1) 

where the second equality is due to the conditional probability equaling 1 (by monotonicity). 
The final probability can be recovered via the additivity of probabilities to get Pr(y'(0) = 
0, F(l) = 1) = 1 - Pr(F(l) = 0) - Pr(F(0) = 1). 

In cases where the outcome has more than two levels, monotonicity is not sufficient 
for fully identifying the joint probability distribution of the potential outcomes. However, 
certain estimands of interest can be bounded by estimands that do not require knowledge of 
the full joint. 

Proposition 4.2. For k level ordinal non-numeric outcomes under the assumption of a 
monotone treatment effect, we have 

1. j < median[y(l)|F(0) = j] 

2. j < median[F(l)|F(0) > j] 

3. median[y(0)|y(0) < j] < median[F(l)|y(0) < j] 

Proof. Monotonicity states that Pr(F(l) > H(0)) = 1, which implies Pr(F(l) > i|H(0) = 
i) = IVh This proves 1 and 2. Inequality 3 follows from the fact that the truncated 
distribution y(l)|y(0) < j is necessarily shifted to the right of the truncated distribution 
V'(0)|H(0) <j. □ 

Similar bounds can be derived for statements about other conditional functions such as 
the mode or other quantiles. This assumption is only appropriate when the non-negativity 
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or non-positivity of the treatment effect is known a priori. For example, it is reasonable 
to assume that a reading comprehension program can only improve reading skills, but it 
might be inappropriate to assume that attending a classical music concert cannot have both 
positive and negative effects on a person’s state of mind. 

4.2 Latent variable formulation 

We have previously introduced the latent variable formulation for ordinal non-numeric poten¬ 
tial outcomes. Intuitively, this approach makes an assumption about the underlying science. 
This is because we require an explicit functional relationship between the potential outcomes 
proper, (1^(0), Fj(l)), and the latent potential outcomes, (Zj(0), Zj(l)). We write this map 
as a pair of functions for f G {0,1} that map Zi{t) —)■ Yi{t). We make the dependence on 
the treatment explicit since estimation cannot proceed without identifying what the form of 
treatment is on the latent scale. The only functional restriction on the two gt functions is 
that y = gt{z) > gt{z') = y' implies that z > z'. This general formulation is not a model for 
estimation but rather a fundamental assumption about the science. 

For estimation to work, we must choose an explicit functional form for the treatment 
effect. A natural choice here is a linear treatment effect, thus simplifying the notation as 
we have g{Z{l)) = g{Z{0) + c). A possible nonlinear functional form for the treatment 
effect makes the assumption that go and gi have different cutoff values for mapping between 
the latent and observed scale. Whatever the assumption made about g, the procedure for 
inference is as follows. 

1. Choose the functional form for go and gi. 

2. Write explicitly = 9 Wi{Z°^^) and let Zf^^ = f{Xi) where Wi is the assigned 
treatment to unit i. In the language of generalized linear models g is our link function 
and / the mean function that is there for notational purposes to explicitly state the 


22 



functional dependence of the latent variable on additional covariates. 


3. Estimate go, gi and / using the observed data: 

4. Impute the missing potential outcomes It™'® by writing Zf"® = f{Xi) and let It™® = 

5. Estimate the estimand on the observed scale using the imputed data (y°^®, y™®). 

The procedure described above takes advantage of the continuity of the latent scale to 
make inference about the observed scale tractable. When we consider a linear effect of 
treatment on the latent scale, we can write in steps 2 and 4 above: = f{Xi,Wi) and 

drop the dependence of g on the treatment assignment. One of the biggest advantages 
for employing the latent scale formulation is the ability to estimate the uncertainty about 
the observed scale estimands. To do so we must compute the variance of the predictive 
sampling distribution of the observed scale potential outcomes. While this can be done 
explicitly as outlined above, the desired quantities can be conceptualized as a consequence 
of a Bayesian estimation approach in which the estimands on the observed scale are functions 
of the posterior predictive distribution. We outline such a Bayesian estimation procedure 
next, and we discuss the selection of priors. 


5 Bayesian formulation 

Bayesian estimation procedures play an important role in causal analysis (Rubin, 1978). We 
consider a single treatment and single control group, with k ordinal non-numeric outcomes. 
For each unit i we observe a potential outcome associated with its treatment assignment 
as well as covariates that provide background information and will be used to adjust the 
outcomes. As in Section 4, we write Zi{t) for the latent scale potential outcome for treatment 
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t, and we need to estimate the functions gt and /. The potential outcomes on the observed 
scale, Ti(t), take on values in {1,, k}. There are n units. 

Formally, we can write the assumed model as follows. 


Y, {t) = gt{Z,{t)) 

Z, {t) = f{X,,Wi) + e,{t) 

/ ,\ iid 
ei[t) ~ TT 

where tt is the distribution of the errors that has no unknown parameters (this is an iden- 
tihcation assumption and the reason why latent scale estimands are frequently inappropri¬ 
ate), and where Yi{t) = gt{Zi(t)) = j if z E for a monotonic increasing sequence 

= {-(X) = 4,-Si ,... ,sl = cx)}. For example, if gt = g, f{Xi, Wt) = Xiji + Wi(3y,, = 5° 

and TT is a standard normal distribution, then this is a standard order probit model. If tt 
is the logistic distribution then this is a standard ordered logit, and so on. One can also 
consider a scenario where /dto = 0 but gi 4 Qo and 4 that is, no linear effect of 
treatment on the latent variables, but possibly different cutoff values for the treated and 
control groups. All of these options are estimable using a Bayesian approach as long as the 
distribution of the missing potential outcomes conditional on the observed ones is tractable. 
Below we outline a standard Gibbs sampler scheme for the ordered probit model. 

5.1 Prior choice for the ordered probit 

The hrst step in Bayesian inference is prior choice. We require priors for the parameters 
(/3, /du,), as well as for the cutoff values S. In a slight abuse of notation, we will write Xi for 
the combined vector of pre-treatment variables with the indicator for treatment and we will 
write (5 for the combined vector of coefficients. Since the latent errors are standard Gaussian, 
a natural conjugate prior for the parameters /d is also Gaussian, for example, with mean 0 
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and covariance matrix n{X^X)~^. The more complicated prior choice is for the cutoff values 
S. Note that, if we knew the cutoff values exactly, we would in fact have the proper scaling 
for doing inference on the latent scale. 

Even though the cutoff parameters are critical for identifying the model, a default prior 
for the cutoff values S does not exist. We note that a prior for the cutoff values does not 
have to respect the ordering of the elements of S since the ordering will be imposed by the 
likelihood function. As such, without scientihc knowledge, a reasonable prior that carries 
very little information is essentially flat. A potential choice is a product of mean zero normal 
variables with a large variance parameter. 

Because of the difficulty of choosing a reasonable prior for S and since we are not inter¬ 
ested in interpreting the latent scale of the potential outcomes, we argue it is more meaningful 
to consider the rank likelihood as the data likelihood as discussed below. 

5.2 Posterior inference for the ordered probit 

For the probit model with independent priors for thresholds S and a A^(0, prior 

for the /3s the full conditional distributions are easy to derive and so we can formulate a 
Gibbs sampler with the following steps. 

1. Initialize /3[o], Z[o],iS[o] where in an abuse of notation, the bracketed subscript refers to 
the iteration of the Markov chain Monte Carlo algorithm. 

2. Sample/3 h|Z[z_i], 

3. For each unit i sample ~ N{Xif3[i], (Z[«p). 

4. For each cutoff point j sample S[zp|F ~ p{s[l])5^^^{Z[,]i\Yi=j},mm{z^t]i\Yi=j+l}{s[l]j) 

5. Sample /3[;],iS[;] from the posterior predictive distribution and construct the 

causal estimand of interest 
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Above, the function 6a,b(c) is equal to one if a < c < 6. Steps 2 through 5 are iterated 
until the joint distribution of {(3,Z,S) reaches stationarity. The distribution of the T[;] for 
I G {1, • ■ ■ ,L} provides an approximation for the posterior predictive distribution of the 
causal estimand of interest on the observed scale. A histogram summary of the posterior 
predictive distribution provides both a measurement of the most likely value of the estimand 
and a measure of the certainty about this value. 

If we choose to avoid the prior specification for S, we can employ the rank likelihood, first 
discussed by Pettitt (1982) and employed in the ordered probit setting by Hoff (2008). In the 
rank likelihood case, we do not need to estimate the cutoff values in S. Instead we require 
that the latent outcomes Zs must he m the set ^ ^ ^ Y^,}. 

Posterior inference with this assumption forgoes step 4 of the procedure above. The full 
conditional for the parameters (3 remains the same and so only step 3 must be changed to 
reflect the rank likelihood as the full conditional distribution of now depends on the 
remaining Z^_^ji. That is the sampling distribution in step 3 becomes 

Complications. In applications it may be desirable to assume that qq ^ gi and that 
5° 7 ^ 5^. Both of the above approaches above can be employed to perform inference under 
this more complex model. Assuming that the only difference between the two functions gt 
is in the cutoff values (that is, the mean function / remains the same, with or without the 
additive treatment effect) the MCMC procedure described above does changed substantially. 
Specifically, step 2 remains the same where Z^_^ includes all of the units. The main changes 
appear in steps 3 and 5 (and step 4 if it is needed): each of these steps is split into an update 
for the control and an update for the treated groups since these groups now have their own 
parameters. Further variations on the model can also be introduced as long as all of the 
requisite probability distributions can either be computed or sampled from. 
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Table 2; The distributions of individual educational outcomes according to treatment status. 



“less than high school” 

“high schooF’ 

“associates” 

“bachelor” 

“graduate” 

Control 

79 

378 

52 

112 

49 

Treatment 

2 

46 

11 

65 

41 


6 Analysis of educational outcomes in the General Social 
Survey 


In this section we consider data from the 1994 General Social Survey on the educational 
outcomes of a sample individuals living in the United States (Smith et ah, 2013). Each of 
the 835 male respondents who were between the ages of 25 and 60 and in the workforce 
during the survey provided information on their educational outcomes as well as information 
about whether at least one of their parents had attained a college degree or higher. The 
possible levels of education recorded for an individual were “less than high school”, “high 
school”, “associates”, “bachelor” and “graduate”. These data are presented in Table 2 and 
Figure 1 and were previously studied in Hoff (2007, 2009). 

Based on the marginal distributions alone one might suspect that a positive treatment 
effect exists as over 50% of individuals in the treated group achieved an educational level 
of college or above, while only 24% of the control group have this educational level. More 
generally, the marginal distribution of the treatment group stochastically dominates the 
marginal distribution of the control group. However, this information does not demonstrate 
the magnitude of the effect. As such we consider the conditional estimands that we previ¬ 
ously described. Here we are interested in the conditional median potential outcome under 
treatment given a particular level of the potential outcome under control. This estimand 
can capture different magnitudes of the effect for individuals whose potential outcomes differ 
under control. For example, we might expect that individuals whose potential outcome un¬ 
der control is college or better to have a lower effect of treatment than for individuals whose 
potential outcome under control is less than college. 
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The following specifies a model for the potential outcomes using a latent variable repre¬ 
sentation with correlated potential outcomes: 

Y,{T) = g{Z,{T)) e,(T) ~ normal(0,1) 

Zi{T) = I3T + EiiT) cor(ei(0), £*(1)) = p 

The function g maps the latent variable representation to the ordered non-numeric space and 
the parameter p captures the dependence among the potential outcomes for an individual. 
It is clear that the data contains no information about the parameter p since we only observe 
one of the potential outcomes for each individual (Imbens and Rubin, 2015). When p = 1 
then the linear relationship between the potential outcomes on the latent scale is exact. On 
the other hand p = 0 suggests that all of the effect is captured by the coefficient f3 on the 
latent scale. Other choices of p G (0,1) describe different measures of positive dependence. 
As in Dasgupta et ah (2014), we treat the correlation parameter p as known. We explore 
four values of p: 0.25, 0.50, 0.783 and 1. The third value of p is chosen via the following 
heuristic argument: It is the Frechet-Hoeffding upper bound for the correlation of two random 
variables with marginals given by the observed potential outcomes. Since the correlation of 
two coarsened random variables is necessarily not bigger than the correlation between the 
two uncoarsened versions, the choice of the upper bound in our heuristic reflects that the 
correlation between the latent potential outcomes is greater than the correlation between 
the observed scale potential outcomes. 

We use the Bayesian approach that employs the rank likelihood described in the previous 
section to obtain the posterior predictive estimates of the estimands of interest. We draw 
50,000 posterior predictive samples for each estimand of interest and report the posterior 
median in Table 3. Intervals are provided where the posterior is not a point. 

The results presented in Table 3 reveal that the analysis is only mildly sensitive to the 
choice of p. The conditional estimates that change the most with p correspond to categories 
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Table 3; Posterior median estimates of the estimands of interest for the General Social Survey 
where 1 =“<high school”, 2 =“high school”, 3 =“associates”, 4 =“bachelor” and 5 =“graduate”. 
95% conhdence intervals are reported as follows: * = (4,5), ^ = (2,4), ° = (2,3), ° = (3,4), 
otherwise the interval is a point. 


p 

J = 1 

3 = 

median [ F'(l) 

2 j = 3 

y(o) =j 1 
j =4 

3=5 

0.25 

4^ 

4 

4* 

4* 

5* 

0.50 

2° 

4 

4* 

5* 

5 

0.783 

2 

4 

5* 

5 

5 

1 

2 

4° 

5 

5 

5 


for which there is not much data information (“<high schooF’ and “associates”). For example, 
when p = 0.25 we estimate median[y(l)|y(0) = “<high school”] to be “bachelor”, but the 
95% posterior interval includes the estimate based on all other p values, “high school”. The 
hrst order conclusion is that in fact there is a treatment effect, which agrees with previous 
insights into the subject of the effect of parental education on child educational outcomes 
(Burnhill et al., 1990). The contribution of our method provides a breakdown of this effect 
conditional on the potential outcome under control. In particular, we see that for all indi¬ 
viduals who would have at least attained a high school diploma under control, the effect of 
a parental college degree is that they themselves attain at least a college degree. 


7 Concluding remarks 

In this article we described technical difficulties that arise in causal analyses when the po¬ 
tential outcomes take ordinal non-numeric values, and proposed solutions. 

In Section 2, we proposed a class of multidimensional estimands that depend on the 
distribution of the potential outcomes under treatment conditional on those under control. 
These estimands, of the form median[y(l)|y(0) = j] are especially useful when experi¬ 
ments are conducted with the goal of planning future interventions. For example, in a 
health experiment where outcomes attempt to measure happiness and can take levels of 
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“sad” and “happy”, a practitioner can decide to assign an anti-depessant to individnals only 
if niedian[y(l)|y(0) = “sad”] = “happy”. One can imaging snch a prescription wonld be 
made conditional on covariates. Additionally, we described why classical one dimensional 
estimands, snch as the average treatment effect and the difference in medians between treat¬ 
ment and control, are inappropriate for ordinal non-nnmeric data. One dimensional esti¬ 
mands that are appropriate mnst be scale-free. When one dimensional estimands are of 
interest, we recommend nsing a measnre of distance between the two marginal distribntions 
of the potential ontcomes when testing the sharp nnll hypothesis of no effect in Section 3.1. 
We also introdnced a more general testing framework that relaxes the sharp nnll of constant 
non-zero effect to accommodate scale-free data, in Section 3.2. 

We also discnssed an additional class of estimands based on potential ontcomes on a 
latent scale, in Section 2.2. Estimands defined on the latent scale, generally snffer from 
non-identihability issnes dne to the mapping from the observed to the latent scales. As snch, 
we cantion practitioners when nsing them in applications. Nonetheless, a latent variable 
formnlation does allow ns to develop a Bayesian estimation framework for providing posterior 
predictive estimates of the indnced estimands on the observed scale. We demonstrate this 
procednre nsing an ordered probit as well as a rank likelihood, in Section 5, and illnstrate 
the proposed methods with an application to edncational ontcomes in the General Social 
Snrvey, in Section 6. 
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